@jfy133
Deoxyribonucleic acid (/diːˈɒksɪˌraɪboʊnjuːˌkliːɪk, -ˌkleɪ-/ (DNA) is a molecule composed of two polynucleotide chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth and reproduction of all known organisms and many viruses. - Wikipedia
Cytosine, ThymineGuanine Adenine &C with G (think: CGI)A with T (think: AT-AT walker)C on one strand, G on the other (or v.v.)A on one strand, T on the other (or v.v.)C, get new G (etc)Converting the chemical nucleotides of a DNA molecule
to
ACTG on your computer screen
Estevezj, CC BY-SA 3.0 via Wikimedia Commons
Not really ‘next’ anymore, consider it more ‘second’ generation (see: Nanopore)
Market leader:
Konrad Förstner, CC0, via Wikimedia Commons
(Others: Roche 454, PacBio, IonTorrent etc.)
i.e. to a strand, attach a complementary fluorophore-modified nucleotide, (normally) one colour per base
A
G
T
C
Fire mah lazer, and take a picture! Rinse and repeat!
On a ‘flow cell’
Bronner et al. (2013) Current Protocols in Human Genetics, DOI: (10.1002/0471142905.hg1802s79)
But how do you get your DNA to attach to the lawn
(and not get lost)?
AATGATACGGCGACCACCACaccgacaaCCCTACACGACGCTCTTCCGATCTXXXXXXAGCACACGTCTGAACTCCAGTCACgacactaCCGTCTTCTGCTTG ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| TTACTATGCCGCTGGTGGTGtggctgttGGGATGTGCTGCGAGAAGGCTAGAXXXXXXTCGTGTGCAGACTTGAGGTCAGTGctgtgatGGCAGAAGACGAAC
[Adapter & Index Primer] [Index] [Target primer] [Target] [Target primer] [Index] [Adapter & Index Primer]
Once bound, florescence of one molecule not enough…
DMLapato, CC BY-SA 4.0, via Wikimedia Commons
Abizar Lakdawalla , CC BY 3.0, via https://openlab.citytech.cuny.edu/
EMBL-EBI Training, CC BY-SA 4.0, via https://www.ebi.ac.uk/training/
N© 2021 Illumina, Inc. All rights reserved. Used here for training purposes only.
Special software (e.g. bcl2fastq):
For each location on the flow cell (cluster):
Group each recorded sequence or ‘reads’ with those with the same index
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. - Wikipedia
Example
@K00233:37:HGHLYBBXX:3:1101:2646:1121 1:N:0:NACGCATC+NGCTAATG
NCGCATGAGCCGCCTGTATCAGGCGCTGATCGAACCGGGCATTGCAGTTGGGATAGATCGGAAGAGCACACGTCTG
+
#A7F<<AA<JFJFJJJJJJFFJJJJJJJAFFJFJJJJJJJFJAFFFJAJFJJ<FJJJJJFFF<FFA--FFFJJJJJ
@K00233:37:HGHLYBBXX:3:1101:4655:1121 1:N:0:NACGCATC+NGCTAATG
NATGCATGACAGGAGGTGAGGGCATTTTCCAGATTTTCAGGCTGCGACCTTGAGCATCTTTCGCCGCTTCCAGCAC
+
#AA-<FFFF7JFF7JJJJJFJJ<JJJJJA7FJJJJJJJFF<JFF<J7-<FJJJJFJFFJJJAAAAFFJJ--AJAJJ
@ <read id, e.g. machine ID, location on flowcell> <extra metadata>
<DNA sequence; Note: N = base couldn't be called!>
+ <a separator>
<base quality scores for each nucleotide in sequence>
Quality score
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
0.2......................26...31........41
ACTG)C-G, A-T)What is the command line?
A command-line interface (CLI) processes commands to a computer program in the form of lines of text. - Wikipedia
A command prompt (or just prompt) is a sequence of (one or more) characters used in a command-line interface to indicate readiness to accept commands. - Wikipedia
<username>@<machine_name>:<current_directory>$
$ is where you type your commandType in everything after the prompt, and press enter/return (⏎) on your keyboard with
Hello world!
-h or --help)What is in the room (directory)
Lets go in the directory, and see what’s in there!
How to go back?
We will run the nf-core/eager pipeline.
nf-core/eager is a scalable and reproducible bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA (aDNA) data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes.
Pipeline (software): a chain of data-processing processes or other software entities
What is the output?